Skip to content

pipelines: backfill ingest_mode and auth_type on transform_ocsf/#61

Merged
nate-smalls-s1 merged 3 commits intoSentinel-One:mainfrom
natesmalley:transform-ocsf-backfill
Apr 27, 2026
Merged

pipelines: backfill ingest_mode and auth_type on transform_ocsf/#61
nate-smalls-s1 merged 3 commits intoSentinel-One:mainfrom
natesmalley:transform-ocsf-backfill

Conversation

@natesmalley
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #59 (merged). Backfills the ingest_mode and auth_type fields onto every existing pipelines/community/transform_ocsf/*/metadata.yaml, so the new schema applies uniformly across the directory rather than only to entries added after #59.

What changed

  • 129 metadata.yaml files modified. Each one gains two new lines inserted immediately after the existing ingestion_method: line:
      ingest_mode: "..."
      auth_type: "..."
  • No serializer logic, no pipeline JSON, no other metadata changed. The diff is exactly +258 lines / 0 deletions across 129 files (plus one CHANGELOG entry).

How values were derived

For each transform, ingest_mode is determined by combining two signals:

  1. Bound parser metadata (parsers/community/<source_name>/metadata.yaml) when authoritative. If the parser declares format: syslog | CEF | RFC5424 | custom syslog | w3c or ingestion_method containing Syslog or HEC, the parser's declaration wins.

  2. Vendor and product knowledge for the ~90 entries where the parser metadata is unclear (format: gron with ingestion_method: streaming or unknown, or no parser binding at all). Examples of the patterns applied:

    • Cisco network kit (firewalls, ASA, Meraki, ISE, Umbrella, etc.) → Syslog
    • Microsoft 365 / Entra / Defender management surfaces → API Call with OAuth
    • AWS managed services delivering to S3 (CloudTrail, ELB, Route 53 Resolver, GuardDuty export, VPC flow) → Other - {object store with SQS notifications} with IAM Role
    • Azure Event Hub-delivered streams (signin, defender email) → Other - {Azure Event Hub stream (AMQP/Kafka protocol)} with OAuth
    • SaaS REST APIs (Okta, Snyk, Wiz, Tenable, Mimecast, Netskope, Proofpoint, GitHub, Google Workspace, Cloudflare, etc.) → API Call with the vendor's typical auth (Bearer Token, API Key & Secret, or OAuth)

auth_type reflects the upstream collector's typical auth pattern (N/A for syslog, IAM Role for AWS object stores, OAuth for Microsoft/Google APIs, API Key & Secret / Bearer Token for vendor REST APIs, etc.). Where the vendor supports multiple ingestion modes, the most-common deployment pattern is recorded.

Per-entry confidence is captured in the staging file .reorg-prep/inventory/transform_ocsf_classifications.tsv (untracked) as one of high (103), medium (17), or low (9). The low entries are genuinely generic placeholders (json_generic_logs, sample_test_logs, microservice_tracing_logs, mail_server_logs, etc.) where a more specific value is not derivable; they use Other - {Explain: ...} with the reason inline.

Resulting distributions

ingest_mode:

Value Count
Syslog 56
API Call 39
Other - {object store / Event Hub / agent / Kafka / etc.} 34

auth_type:

Value Count
N/A (syslog / file-based / generic) 75
API Key & Secret 20
OAuth 18
IAM Role 8
Bearer Token 7
Other (Kafka SASL) 1

What is NOT in this PR (intentional)

  • palo_alto_networks_firewall/ is intentionally not modified — it is being removed in pipelines: drop F-graded PAN-OS firewall transform; document PAN-OS variants #60 (open). Both PRs apply cleanly regardless of merge order.
  • No directory renames or relocations. Migration of transforms into the push/pull/ structure (and any rename for naming consistency) is a separate follow-up PR.
  • No serializer logic changes.
  • No pipeline JSON changes. The bound source_name (parser binding) is unchanged for every entry.

Test plan

  • CI passes (CodeQL, secret scanning, contributor automation)
  • git diff --stat shows exactly 129 metadata.yaml files in pipelines/community/transform_ocsf/, each +2 -0, plus one CHANGELOG entry
  • No file outside pipelines/community/transform_ocsf/*/metadata.yaml and CHANGELOG.md was modified
  • Spot-check 5 representative entries on github.com to confirm YAML validity and the new fields render correctly:
    • paloalto_logs (Syslog / N/A)
    • okta (API Call / API Key & Secret) — orphan-bound entry
    • aws_cloudtrail (Other - object store / IAM Role) — orphan-bound entry
    • microsoft_eventhub_azure_signin_logs (Other - Event Hub / OAuth)
    • cisco_duo (Syslog / N/A) — parser-bound entry
  • Existing grade: blocks at the top of each metadata.yaml are unchanged (the automated grader signal is preserved)

Nate Smalley and others added 2 commits April 26, 2026 20:50
Adds the new metadata fields introduced by Sentinel-One#59 to all 129 existing
transform_ocsf/ pipeline metadata.yaml files. The fields are inserted
immediately after the existing ingestion_method line in each file. No
serializer logic, no pipeline JSON, no other metadata changed.

Values were derived per entry by combining:

1. Bound parser metadata (parsers/community/<source_name>/metadata.yaml)
   when the parser declares format=syslog/CEF/RFC/w3c/custom-syslog or
   ingestion_method containing "Syslog" or "HEC" -- the parser is
   authoritative when its declaration is unambiguous.

2. Vendor and product knowledge for the ~90 entries where the parser
   metadata is unclear (gron format with "streaming" or "unknown"
   ingestion_method, or no parser binding at all). Examples:
   - Cisco network kit (firewalls, ASA, Meraki, ISE, etc.) -> Syslog
   - Microsoft 365 / Entra / Defender management surfaces -> API Call (OAuth)
   - AWS managed services delivering to S3 (CloudTrail, ELB, Route53
     Resolver, GuardDuty export, VPC flow) -> Other - {object store with
     SQS notifications} (IAM Role)
   - Azure Event Hub-delivered streams (signin, defender email) ->
     Other - {Azure Event Hub stream (AMQP/Kafka protocol)} (OAuth)
   - SaaS REST APIs (Okta, Snyk, Wiz, Tenable, Mimecast, Netskope,
     Proofpoint, GitHub, Google Workspace, Cloudflare, etc.) -> API Call
     with the vendor's typical auth (Bearer Token, API Key & Secret,
     or OAuth)

Confidence per entry is recorded in
.reorg-prep/inventory/transform_ocsf_classifications.tsv as one of
high (103), medium (17), or low (9). Low-confidence entries are
genuinely generic placeholders (json_generic_logs, sample_test_logs,
microservice_tracing_logs, etc.) where a more specific value is not
derivable; they use Other - {Explain: ...} with the reason inline.

palo_alto_networks_firewall/ is intentionally not modified because it is
being removed in PR Sentinel-One#60 (open).

Resulting distribution:
  Syslog                                              56
  API Call                                            39
  Other - {object store / Event Hub / agent / etc.}   34

Auth distribution:
  N/A (syslog / file-based / generic)                 75
  API Key & Secret                                    20
  OAuth                                               18
  IAM Role                                             8
  Bearer Token                                         7
  Other (Kafka SASL)                                   1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nate-smalls-s1 nate-smalls-s1 merged commit af61b53 into Sentinel-One:main Apr 27, 2026
1 check passed
nate-smalls-s1 pushed a commit that referenced this pull request Apr 27, 2026
Moves 91 community pipeline directories from
pipelines/community/transform_ocsf/<name>/ into the ingest-mode-first
taxonomy introduced in #59:

  pipelines/push/syslog/<vendor>/<product>/      57 entries
  pipelines/pull/api/<vendor>/<product>/         29 entries
  pipelines/pull/object_store/<vendor>/<product>/  5 entries

The mode bucket is determined by each entry's ingest_mode field (backfilled
in #61). The vendor and product split is derived per entry from the
upstream parser binding and vendor/product convention; collisions across
the cluster (Cisco Meraki, Fortinet, Cloudflare, Zscaler, Microsoft, etc.)
are disambiguated with explicit product-name overrides documented in
.reorg-prep/inventory/transform_ocsf_migration_plan.tsv.

History is preserved on every entry (git mv).

What stays in pipelines/community/transform_ocsf/ (15 entries):
  - Generic / template / unknown-vendor entries: agent_metrics_logs,
    generic_access_logs, inngate_gateway_logs, json_generic_logs,
    json_nested_kv_logs, leef_template_logs, log4shell_detection_logs,
    mail_server_logs, microservice_tracing_logs, sample_test_logs,
    spam_detection_logs, sql_database_logs, syslog_space_delimited_logs,
    vpc_logs, jruby_application_logs.

What is NOT in this PR (intentional):
  - 23 entries scheduled for removal in #62 (broken-legacy, 7) and #63
    (first-party ingestion paths, 16) are NOT moved; they remain in
    transform_ocsf/ until those PRs merge. This PR has no overlap or
    conflict with #62/#63 -- merge order does not matter.
  - No serializer logic, no metadata.yaml content, and no pipeline JSON
    content was modified. Every change is a directory rename.
  - No naming-consistency cleanup (e.g., paloalto_* -> palo_alto/*) is
    applied yet; that is a separate follow-up.

The pipelines/push/{syslog,hec}/ and pipelines/pull/{api,object_store}/
directories are now populated -- the empty scaffolding from #59 finally
has content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants